5. Main Analysis - All
Provide a detailed, well-organized description of your findings, including textual description, graphs, and code. Your focus should be on both the results and the process. Include, as reasonable and relevant, approaches that didn’t work, challenges, the data cleaning process, etc. . The guidelines for the Executive Summary above do NOT apply to exploratory data analysis. Your main concern is designing graphs that reveal patterns and trends. . As noted in Hmk #4, do not use circles, that is: bubbles, pie charts, or polar coordinates. . Use stacked bar charts sparingly. Try grouped bar charts and faceting as alternatives, and only choose stacked bar charts if they truly do a better job than the alternatives for observing patterns.
Zip Code, Neighborhood and Land Value Data - Adam
One of our other main questions in this project was looking at how the wealth was distributed amongst each of the zipcodes in the five boroughs. Using the census data from the superzip dataset and mapping that to the zipcode shapefile document we found, we managed to visualize the wealth of each neighborhood. We will later be able to plot the license applications over this mapping of the wealth distribution.
map <- get_map("New York City", source = "google", maptype = "toner", zoom = 11)
maptype = "toner" is only available with source = "stamen".
resetting to source = "stamen"...
Source : https://maps.googleapis.com/maps/api/staticmap?center=New+York+City&zoom=11&size=640x640&scale=2&maptype=terrain
Source : https://maps.googleapis.com/maps/api/geocode/json?address=New%20York%20City
Source : http://tile.stamen.com/toner/11/601/768.png
Source : http://tile.stamen.com/toner/11/602/768.png
Source : http://tile.stamen.com/toner/11/603/768.png
Source : http://tile.stamen.com/toner/11/604/768.png
Source : http://tile.stamen.com/toner/11/601/769.png
Source : http://tile.stamen.com/toner/11/602/769.png
Source : http://tile.stamen.com/toner/11/603/769.png
Source : http://tile.stamen.com/toner/11/604/769.png
Source : http://tile.stamen.com/toner/11/601/770.png
Source : http://tile.stamen.com/toner/11/602/770.png
Source : http://tile.stamen.com/toner/11/603/770.png
Source : http://tile.stamen.com/toner/11/604/770.png
Source : http://tile.stamen.com/toner/11/601/771.png
Source : http://tile.stamen.com/toner/11/602/771.png
Source : http://tile.stamen.com/toner/11/603/771.png
Source : http://tile.stamen.com/toner/11/604/771.png
ggmap(map, base_layer = ggplot(aes(x = longitude, y = latitude, color = avg), data = census)) + geom_point(size = 3, alpha=0.7) + scale_color_viridis() + ggtitle("Average Salary of each Zipcode in New York")
This map shows all of the zipcodes and we can clearly see that by far the greatest incomes per zipcode are in the Upper East Side and Upper West Side. There appears to be some grey areaa if you look closely at the Upper West Side. This corresponds to about 150 on the color map. Then, midtown and downtown also have high incomes. After that, they all fall towards the bottom of the spectrum. Thus, I think it would be useful to make a map showing only Manhattan, and then the everything excluding Manhattan. This should give us a better idea of what areas outside of Manhattan might see more bar/cafe openings.
year_bxp<-ggplot(sidewalks_nbh, aes(y=SUBMIT_YEAR,x=1))+geom_boxplot()+ggtitle("Boxplot of Submission Years")
year_bar<-ggplot(sidewalks_nbh,aes(x=SUBMIT_YEAR))+geom_bar()+ggtitle("Bar Graph of Submission Years")
grid.arrange(year_bxp, year_bar, ncol=2)
This graph shows all of the data filtered for Manhattan only. We can see that once we get to Harlem and above, the average median income per neighborhood drops heavily. The rest of the city is pretty similar.
brooklyn_map2 <- get_map("Brooklyn, NY", source = "google", zoom = 12, maptype="toner", color="bw")
maptype = "toner" is only available with source = "stamen".
resetting to source = "stamen"...
Source : https://maps.googleapis.com/maps/api/staticmap?center=Brooklyn,+NY&zoom=12&size=640x640&scale=2&maptype=terrain
Source : https://maps.googleapis.com/maps/api/geocode/json?address=Brooklyn%2C%20NY
Source : http://tile.stamen.com/toner/12/1205/1539.png
Source : http://tile.stamen.com/toner/12/1206/1539.png
Source : http://tile.stamen.com/toner/12/1207/1539.png
Source : http://tile.stamen.com/toner/12/1205/1540.png
Source : http://tile.stamen.com/toner/12/1206/1540.png
Source : http://tile.stamen.com/toner/12/1207/1540.png
Source : http://tile.stamen.com/toner/12/1205/1541.png
Source : http://tile.stamen.com/toner/12/1206/1541.png
Source : http://tile.stamen.com/toner/12/1207/1541.png
census_brooklyn <- census %>% dplyr:: filter(city.x == 'Brooklyn')
ggmap(brooklyn_map2, base_layer = ggplot(aes(x = longitude, y = latitude, color = avg), data = census_brooklyn)) + geom_point(size = 4, alpha=0.7) + scale_color_viridis() + ggtitle("Average Salary of each Zipcode in Brooklyn")
This shows us the average salary throughout the zipcodes in Brooklyn and we can see that it is highest in general when the neighborhood is closer to Manhattan.
At this point, I had realized that it would be hard to make a combined map illustrating both the income data and licenses data by displaying the zips like this. Thus, I found zipcode shapefile data for NYC and imported it in order to make better maps to be used as a base layer for the licensing data. I also begun labelling the neighborhoods using the labels Marika had created.
plotting_zips1<-read.csv("Data/NY_shapefiles.csv")
map2016<-ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2016"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2016")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)
map2016
Here we see that Manhattan has the highest wealth. It has the wealthiest areas but also some areas in Harlem and more uptown that are less wealthy than most neighborhoods in Queens, Brooklyn and the Bronx.
g_all <- ggmap(map) + geom_polygon(aes(fill = avg, x = long, y = lat, group = group), data = plotting_zips1, alpha = 0.8)
g_all + ggtitle("Distribution of Wealth in New York") + scale_fill_viridis()
Here we see a full distribution using the viridis color scheme. This will be used as the base for the overlaid plots later on and we will then filter by each borough.
Wealth by borough - Adam
plotting_zips_manhattan <- plotting_zips1 %>% filter(city.y == "New York")
plotting_zips_brooklyn <- plotting_zips1 %>% filter(city.y == "Brooklyn")
plotting_zips_bronx <- plotting_zips1 %>% filter(city.y == "Bronx")
plotting_zips_queens <- plotting_zips1 %>% filter(!(city.y %in% c("New York", "Brooklyn", "Bronx", "Staten Island")))
#plotting_zips_staten <- plotting_zips1 %>% filter(city.x !%in% c("Manhattan", "Brooklyn", "Bronx", "Queens Village"))
bronx_map <- get_map("Bronx, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
Source : https://maps.googleapis.com/maps/api/staticmap?center=Bronx,+NY&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN
Source : https://maps.googleapis.com/maps/api/geocode/json?address=Bronx%2C%20NY
brooklyn_map <- get_map("Brooklyn, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
Source : https://maps.googleapis.com/maps/api/staticmap?center=Brooklyn,+NY&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN
Source : https://maps.googleapis.com/maps/api/geocode/json?address=Brooklyn%2C%20NY
queens_map <- get_map("Astoria, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
Source : https://maps.googleapis.com/maps/api/staticmap?center=Astoria,+NY&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN
Source : https://maps.googleapis.com/maps/api/geocode/json?address=Astoria%2C%20NY
We are not using Staten Island since we have no sidewalk cafe data in Staten Island
g_manhattan <- ggmap(map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_manhattan, alpha = 0.5)
g_manhattan + ggtitle("Distribution of Wealth in Manhattan") + geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()
This map show us the wealth in Manhattan alone by zipcode. We are now using the viridis color scale so that it is perceptually uniform and easier to see differences in color. This clearly shows that the income is very low at Harlem and above, including Morningside Heights. This is likely due to the fact that it is almost exclusively students living in Morningside Heights and Harlem is cheaper. Also, it is interesting to see that Lower East Side and has a similar income to Harlem. The wealthiest areas are by far Upper West Side and Upper East Side. We can also see that we are missing what seems to be data for two zipcodes in the Upper East side. This corresponds to some of the data that Marika had found was missing and still did not appear in the superzip dataset.
g_brooklyn <- ggmap(brooklyn_map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_brooklyn, alpha = 0.5)
g_brooklyn + ggtitle("Distribution of Wealth in Brooklyn") + geom_text_repel(data=brooklyn_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()
By observing the income distribution in Brooklyn, we see that in general, the areas closer to Manhattan have the highest income such as Brooklyn Heights and Crown Heights.
g_bronx <- ggmap(bronx_map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_bronx, alpha = 0.5)
g_bronx + ggtitle("Distribution of Wealth in Bronx") + geom_text_repel(data=bronx_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()
g_queens <- ggmap(queens_map) + geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_queens, alpha = 0.5)
g_queens + ggtitle("Distribution of Wealth in Queens") + geom_text_repel(data=queens_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3) + scale_fill_viridis()
Surprisingly, it seems that in Queens, some of the higher income areas actually appear further away from Manhattan like in Flushing and Forest Hills and Rego Park. However, there are some higher income areas closer to Manhattan as well like East Elmhurst.
Sidewalk Cafe License Data - Marika
Any business that operates a portion of a restaurant on a public sidewalk must obtain a Sidewalk Cafe License from New Yor City. These licenses must be renewed every two years and fall into three categories: enclosed, unenclosed, or small unenclosed sidewalk cafes.
Analyzing by Borough
First, to help better organize the sidewalk cafe licenses by borough, I added a new column called BOROUGH that is set to MANHATTAN, BROOKLYN, BRONX, or QUEENS. I had to manually check that only the cities in Queens had been called out specifically in the CITY column, so it was easy to distinguish them from BRONX or BROOKLYN.
sidewalks <- sidewalks %>% mutate(BOROUGH = ifelse(CITY=="NEW YORK"|CITY=="New York","MANHATTAN",ifelse(CITY=="BROOKLYN","BROOKLYN",ifelse(CITY=="BRONX","BRONX","QUEENS"))))
To get a better understanding of the distribution of these licenses, I have provided a bar graph by borough.
ggplot(sidewalks, aes(x=fct_infreq(BOROUGH)))+geom_bar(aes(fill=BOROUGH))+ggtitle("Frequency of Sidewalk Cafe Licenses by Borough")+xlab("Borough")+ylab("Frequency")
Clearly Manhattan has the most license requests, followed by Brooklyn, then Queens and finally Bronx. Since at the moment we don’t have neighborhood information (everything in Manhattan is just classified as New York, Brooklyn has only Brooklyn, and Bronx has only the city of Bronx), we can only dive into the Queens data:
queens_cafes <- sidewalks %>% filter(BOROUGH=="QUEENS")
ggplot(queens_cafes, aes(x=fct_infreq(CITY)))+geom_bar(fill="purple")+ggtitle("Frequency of Sidewalk Cafe Licenses in Queens")+xlab("City / Neighborhood")+ylab("Frequency")+coord_flip()
In Queens, a large percentage of license requests come from Astoria, followed by Long Island City and Forest Hills.
Next, in order to do date comparisons to ascertain which are the new applications vs. renewal applications, I had to convert certain date fields from strings (they were read in as string factors) into dates.
sidewalks$EXPIRATION_DATE<-as.Date(sidewalks$EXPIRATION_DATE, format="%m/%d/%Y")
sidewalks$APP_STATUS_DATE<-as.Date(sidewalks$APP_STATUS_DATE, format="%m/%d/%Y")
sidewalks$SUBMIT_DATE<-as.Date(sidewalks$SUBMIT_DATE, format="%m/%d/%Y")
Licenses by Status
The list of licenses includes active licenses, expired licenses, licenses for businesses that have closed (and are now inactive), licenses which are up for renewal as part of the two year process, or new requests for licenses. To better classify them, I created a new field called STATUS_CLASSIFICATION. Those licenses which are still active and not up for renewal are classified as “ACTIVE”. Those licenses that have been submitted for renewal (either because their expiration date is less than the latest application data, or that an active license is up for review) are classified as “RENEWAL”. Those licenses that are in the sheet but do not have a license number are classified as “NEW”, and the rest are marked as “OLD” to encompass inactive licenses that have not been acted upon.
sidewalks<-sidewalks %>% mutate(STATUS_CLASSIFICATION = ifelse(LIC_STATUS=="Active" & (APP_STATUS=="Application Approved" | APP_STATUS=="Application Review Completed"),"ACTIVE",ifelse(is.na(LICENSE_NBR),"NEW",ifelse((APP_STATUS_DATE>EXPIRATION_DATE | DPQA=="Issued Temp Op Letter") | (LIC_STATUS=="Active" & (APP_STATUS=="Pending Review" | APP_STATUS=="Submitted")),"RENEWAL","OLD"))))
Now that we have classified the status of the licenses, we are able to see how these classifications differ between the boroughs.
ggplot(sidewalks)+geom_mosaic(aes(x=product(STATUS_CLASSIFICATION,BOROUGH),fill=factor(STATUS_CLASSIFICATION)))+coord_flip()+labs(x="Borough",y="License Status", fill="License Designation")+ggtitle("Boroughs by License Status")
The mosaic plot shows how Bronx and Brooklyn may be getting more new license requests as a percentage of total licenses. Bronx is also getting the highest percentage of renewal requests out of its inactive and active licenses. We can also take a look at the license designations by borough:
ggplot(sidewalks)+geom_mosaic(aes(x=product(BOROUGH,STATUS_CLASSIFICATION),fill=factor(BOROUGH)))+coord_flip()+labs(x="License Status",y="Borough", fill="Borough")+ggtitle("License Status by Borough")
Looking at the data in this way, you can see how Brooklyn has the second-most new license requests, but how Manhattan still dominates in all license status categories.
Mapping Licenses
We can map the data to have a better view of where the datapoints lie. To get an overall picture, I selected a map centered on Long Island City in Queens so that we can get a good view of both Brooklyn and Bronx in addition to Manhattan.
map <- get_map( location = c(-73.9485424, 40.7454513), source = "google", zoom = 11, maptype="roadmap", color="bw")
Source : https://maps.googleapis.com/maps/api/staticmap?center=40.745451,-73.948542&zoom=11&size=640x640&scale=2&maptype=roadmap&language=en-EN
Plotting each of the restaurants colored by their borough. You can see how Manhattan dominates in the number of sidewalk cafes, and how the sidewalk cafes in Brooklyn and Queens are largely concentrated in the areas closer to Manhattan.
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.3)+ggtitle("Distribution of all licenses (inactive or active) in NYC")
Even with alpha, it is difficult to tell exactly where the largest concentrations of sidewalk cafes lie. Using a density plot, we can better see the concentration of sidewalk cafes around mid-to lower Manhattan, and the clusters in Astoria and Williamsburg.
g <- ggmap(map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks, geom="polygon", alpha=0.2)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in NYC")
The next question we can ask is whether there are clear patterns to where new license requests are coming in from, where they are being renewed, or where they have expired without renewal.
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=sidewalks, alpha=0.4)+facet_wrap(~STATUS_CLASSIFICATION)+ggtitle("Sidewalk Cafe Licenses by Status")
However, since we are zoomed out a lot and looking at the entire set of licenses, it is difficult to tell the difference between the distributions of license requests. For this analysis, I am only concerned about active licenses (whether they are being renewed or not), and new requests that have not yet been approved. To do this, I created another STATUS_CLASSIFICATION that groups active renewal requests into the Active category, and sets all other non-new requests as “inactive”.
sidewalks <- sidewalks %>% mutate(STATUS_CLASSIFICATION2=ifelse(STATUS_CLASSIFICATION=="ACTIVE" | (STATUS_CLASSIFICATION=="RENEWAL" & LIC_STATUS=="Active"), "ACTIVE", ifelse(STATUS_CLASSIFICATION=="NEW","NEW","OLD")))
new_active <- sidewalks %>% filter(STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW")
ggmap(map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active, alpha=0.3)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in New York City")
Since we are quite zoomed out, it is difficult to see what exactly is happening with the New requests. However, if we zoomed in, we would lose information about the Bronx or lower Brooklyn. In order to determine whether there is any useful information there, I have zoomed in on the Bronx region.
Geographic Analysis - Bronx
bronx_map <- get_map("Bronx, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
Source : https://maps.googleapis.com/maps/api/staticmap?center=Bronx,+NY&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN
Source : https://maps.googleapis.com/maps/api/geocode/json?address=Bronx%2C%20NY
ggmap(bronx_map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=BOROUGH),data=new_active)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in the Bronx")
Since there are very few active or new licenses in the Bronx, I feel comfortable in zooming in on the rest of Manhattan / Brooklyn and Queens in order to be able to better see what is happening at the expense of Bronx.
Geographic Analysis - Manhattan
I want to take a closer look at what is happening in Manhattan. In order to do this at a more granular level, I will use my merged data with our master_zip document which lists the neighborhoods of each Borough by zip code. I also pulled out the year of the submission of the license application and saved it as SUBMIT_YEAR.
master_zip<-read.csv("Data/zips_master_no_missing_nbrh.csv", strip.white=TRUE)
sidewalks_nbh <- merge(sidewalks, master_zip, by="ZIP", all.x=TRUE)
sidewalks_nbh <- sidewalks_nbh %>% mutate(SUBMIT_YEAR=year(SUBMIT_DATE))
sidewalks_manhattan_active <- sidewalks_nbh %>% filter(BOROUGH=="MANHATTAN" & (STATUS_CLASSIFICATION2=="ACTIVE" | STATUS_CLASSIFICATION2=="NEW"))
manhattan_map <- get_map("Manhattan, NY", source="google", maptype="roadmap", zoom=12, color="bw")
Source : https://maps.googleapis.com/maps/api/staticmap?center=Manhattan,+NY&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN
Source : https://maps.googleapis.com/maps/api/geocode/json?address=Manhattan%2C%20NY
I have also calculated the average locations of the neighborhoods in each borough based on their assigned Zip codes.
ggmap(manhattan_map)+geom_point(aes(x=LONGITUDE, y=LATITUDE, color=nbh), data=sidewalks_manhattan_active)+ggtitle("All active or new sidewalk cafes in Manhattan")
sidewalks_bronx<- sidewalks_nbh %>% filter(BOROUGH=="BRONX")
This view shows all of the distribution of all currently active or new sidewalk cafes in Manhattan. To get a better idea of density, we can also look at a heatmap of this view.
g <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active, geom="polygon", alpha=0.3)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in Manhattan")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)
This heatmap shows the highest concentration of sidewalk cafes in Greenwich Village and NoHo. To see the difference between active and new licenses, we can facet based on our previously calculated categorization.
g <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan_active, geom="polygon", alpha=0.3)+facet_wrap(~STATUS_CLASSIFICATION2)
g+scale_fill_gradient(low="yellow",high="red")+ggtitle("Heatmap of Sidewalk Cafe Licenses in Manhattan")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)
The heatmap view of the map is not a great one - it is difficult to tell the differences on the same scales in a heatmap. Looking at the same view, but by using geom_point again, we get the following view:
ggmap(manhattan_map)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color=nbh),data=sidewalks_manhattan_active)+facet_wrap(~STATUS_CLASSIFICATION2)+ggtitle("Active and New Licenses in Manhattan") + guides(color=FALSE)+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)
Once again, the data is not telling us too much since there are very few new requests. Perhaps a better way to categorize this data is to look at the years that applications were submitted, therefore ignoring whether a license is currently active or not.
year_bxp<-ggplot(sidewalks_nbh, aes(y=SUBMIT_YEAR,x=1))+geom_boxplot()+ggtitle("Boxplot of Submission Years")
year_bar<-ggplot(sidewalks_nbh,aes(x=SUBMIT_YEAR))+geom_bar()+ggtitle("Bar Graph of Submission Years")
grid.arrange(year_bxp, year_bar, ncol=2)
Since there are very few submissions between 2000 and 2014, I will be removing these outliers as potential mistakes where the submit dates were not updated once the licenses were renewed. Now we can look at a mosaic plot of the different neighborhoods and the years that the licenses were requested.
sidewalks_nbh<- sidewalks_nbh %>% filter(SUBMIT_YEAR>2014)
sidewalks_manhattan<- sidewalks_nbh %>% filter(BOROUGH=="MANHATTAN")
manhattan_counts <- sidewalks_manhattan %>% group_by(nbh, SUBMIT_YEAR) %>% summarise(count=n())
manhattan_counts$SUBMIT_YEAR<-as.factor(manhattan_counts$SUBMIT_YEAR)
manhattan_counts$SUBMIT_YEAR<-factor(manhattan_counts$SUBMIT_YEAR, c("2017", "2016","2015"))
ggplot(manhattan_counts, aes(x=reorder(nbh, count), y=count, fill=SUBMIT_YEAR))+geom_bar(stat="identity",position=position_dodge())+coord_flip()+xlab("Neighborhood")+ggtitle("License Requests per Neighborhood and Year")
It makes sense that 2016 would have more applications than 2015, since many of the 2015 applications have been renewed by now. It looks like Greenwich Village, Upper West Side, and Upper Easy Side consistently have the highest number of requests throughout the three years. However, to get a better understanding of the distribution of areas, we can create three separate heatmaps.
map2015 <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2015"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2015")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)
map2015
map2016<-ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2016"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2016")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)
map2016
map2017 <- ggmap(manhattan_map)+stat_density2d(aes(x=LONGITUDE,y=LATITUDE, fill=..level..),data=sidewalks_manhattan%>%filter(SUBMIT_YEAR=="2017"), geom="polygon", alpha=0.3)+scale_fill_gradient(low="yellow",high="red")+ggtitle("Manhattan Licenses Applied for in 2017")+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)
map2017
Across all three years, you can see how the concentrated area of licenses remains in the Greenwich Village area. However, it looks like the concentration of applicants in the Upper West Side in 2017 has dropped. This map is more useful than the bar chart because it shows the location of the cafes that may be straddling two neighborhoods - information that is not apparent from the grouped bar chart.
We can do the same sort of analysis in all four boroughs that we have sidewalk cafe information about. To make this analysis easier to view, I have created a shiny app which can be found by following this link: ShinyCafe
sidewalks_brooklyn <- sidewalks_nbh %>% filter(BOROUGH=="BROOKLYN")
sidewalks_queens <- sidewalks_nbh %>% filter(BOROUGH=="QUEENS")
sidewalks_bronx <- sidewalks_nbh %>% filter(BOROUGH=="BRONX")
brooklyn_map <- get_map("Brooklyn, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
Source : https://maps.googleapis.com/maps/api/staticmap?center=Brooklyn,+NY&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN
Source : https://maps.googleapis.com/maps/api/geocode/json?address=Brooklyn%2C%20NY
queens_map <- get_map("Astoria, NY", source = "google", zoom = 12, maptype="roadmap", color="bw")
Source : https://maps.googleapis.com/maps/api/staticmap?center=Astoria,+NY&zoom=12&size=640x640&scale=2&maptype=roadmap&language=en-EN
Source : https://maps.googleapis.com/maps/api/geocode/json?address=Astoria%2C%20NY
The Shiny app has a brief blurb explaining what each borough can tell us about the movement of sidewalk cafe applications.
Next, I added Adam’s income distribution ploygons to the background of these maps to see what the relationship is to income.
ggmap(manhattan_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_manhattan, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_manhattan, size=1)+ scale_fill_viridis()+geom_text_repel(data=manhattan_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(color=F)
Manhattan has a pretty clear clustering of sidewalk cafes in the higher income areas, with the exception of Lower Mahattan and the Financial District.
ggmap(brooklyn_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_brooklyn, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_brooklyn, size=1)+ scale_fill_viridis()+guides(color=FALSE)
The missing data in Williamsburg is hindering this analysis, but we do see how the average income in Brooklyn Heights, Park Slope, and Red Hook correlate with more sidewalk cafes.
ggmap(queens_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_queens, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_queens, size=1)+ scale_fill_viridis()+geom_text_repel(data=queens_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(color=F)
Astoria and Long Island City are interesting cases with relatively low average incomes but a high concentration of sidewalk cafes. The other area to the south-east, around Parkside and Forest Hills, has a high concentration of cafes and high income.
ggmap(bronx_map)+geom_polygon(aes(fill = income., x = long, y = lat, group = group), data = plotting_zips_bronx, alpha = 0.7)+geom_point(aes(x=LONGITUDE,y=LATITUDE, color="purple"),data=sidewalks_bronx, size=1)+ scale_fill_viridis()+geom_text_repel(data=bronx_names, aes(mean_lon, mean_lat, label=Neighborhood),fontface="bold", size=3)+guides(color=F)
With the very low number of Bronx datapoints, it is not easy to see a realtionship between income and where the sidewalk cafes are located.
summary_cafe <- merge(zip_sidewalk, mean_incomes, by=ZIP)
Error in fix.by(by.x, x) : object 'ZIP' not found
cafe.lm = lm(mean_income~n_cafes, data=summary_cafe)
summary(cafe.lm)$r.squared
[1] 0.1965228
On all the data, there is pretty weak correlation between income and number of sidewalk cafes, with an r-squares of 0.196.
Faceting by Borough, we can see how the relationship is actually negative in Bronx and Queens, but positive in Manhattan and Brooklyn (with the strongest relationship in Manhattan).